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ABSTRACT 

Classification  and  regression  tree  methodology  is  an  important  and  essential  tool  in  statistics  and  machine  learning.  This  research 
accomplished  several  improvements  and  advancements  in  the  area  and  implemented  them  in  the  GUIDE  computer  software.  The  major 
contributions  are  (i)  a  new  technique  to  deal  with  missing  data  values  that  allows  all  the  information,  including  whether  or  not  an  observation 
is  missing,  to  be  used  for  tree  construction  and  prediction,  (ii)  a  new  method  of  scoring  the  importance  of  variables  that  can  be  used  to 
objectively  reduce  the  number  of  variables  for  prediction  modeling,  (iii)  a  new  approach  to  building  regression  models  for  data  with 
multidimensional  or  longitudinal  response  variables  that  does  not  require  any  model  assumptions,  and  (iv)  several  new  techniques  for 
identifying  subgroups  of  the  data  for  enhanced  differential  treatment  effects. 
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(1)  Handling  of  missing  values  and  comparison  of  techniques. 


One  of  the  most  difficult  but  important  problems  in  statistics  is  how  to  construct  a  prediction  model  that  can  deal  effectively  with 
missing  data  values.  This  problem  has  two  parts:  when  there  are  missing  values  in  the  data  used  to  construct  the  model,  and 
when  there  are  no  missing  values  in  the  training  data,  but  missing  values  occur  in  the  data  used  for  future  prediction.  Many 
solutions  have  been  proposed,  but  there  are  few  comparative  studies.  During  the  period  of  this  grant,  an  effective  solution  was 
found  for  use  in  the  GUIDE  classification  and  regression  tree  algorithm.  If  missing  values  occur  in  the  training  data,  GUIDE 
provides  a  "missing"  category  to  hold  missing  values  when  it  performs  contingency  table  chi-squared  tests  for  split  variable 
selection.  Besides  enabling  GUIDE  to  use  all  the  data  for  variable  selection  at  each  node  of  the  tree,  this  also  makes  the 
algorithm  sensitive  to  informative  missingness,  where  data  are  not  missing  purely  by  chance.  After  a  split  variable  is  selected  at 
a  node,  GUIDE  considers  three  types  of  split  on  that  variable  for  the  node.  The  first  type  is  to  put  all  missing  values  in  one 
branch  of  the  split  and  all  nonmissing  values  in  the  other  branch.  This  allows  GUIDE  to  detect  missingness  dependent  on  the 
response  variable.  The  second  type  of  split  searches  over  the  nonmissing  data  values  to  find  the  optimal  split  point,  c,  such  that 
all  nonmissing  values  less  than  c  and  all  missing  values  are  sent  to  the  left  branch.  The  third  type  is  similar  to  the  second  type, 
except  that  all  missing  values  are  sent  to  the  right  branch  instead  of  the  left  branch.  If  there  are  no  missing  values  in  the  split 
variable,  missing  values  in  future  data  to  be  predicted  are  sent  to  the  branch  with  the  larger  number  of  training  cases.  This 
strategy  permits  GUIDE  to  use  all  the  data,  all  the  time.  Preliminary  results  from  a  large-scale  empirical  comparative  study  of 
GUIDE  against  other  tree  and  non-tree  methods  indicate  that  this  strategy,  on  average,  better  than  other  standard  methods 
such  as  mean/mode  imputation  and  complete  cases.  Details  of  the  method  for  classification  are  reported  in  Loh  (2009). 

(2)  Variable  importance  scores  and  thresholds  for  large  p  small  n  problems. 

Data  sets  with  more  variables  than  observations  have  become  increasingly  frequent.  Because  many  statistical  methods  require 
the  number  of  observations  to  exceed  the  number  of  variables,  there  is  an  urgent  need  to  find  effective  solutions  for  variable 
selection.  For  regression,  the  LASSO  and  similar  solutions  select  variables  by  fitting  a  linear  regression  model  to  the  data  with  a 
combination  of  L1  and  L2  loss  functions,  which  forces  some  variable  coefficients  to  be  zero  and  hence  declares  the  others  to  be 
important.  A  major  weakness  of  this  approach  is  the  assumption  of  an  underlying  regression  model.  Another  weakness  is  that 
missing  data  values  need  to  be  estimated  in  advance.  In  the  latter  situation,  the  performance  of  the  method  depends  critically 
on  the  missing  value  estimation  method.  During  the  period  of  this  grant,  an  importance  scoring  method  was  implemented  in  the 
GUIDE  algorithm.  The  key  idea  is  to  employ  the  chi-squared  statistics  that  are  already  computed  by  GUIDE  during  model 
construction.  By  arguing  that  the  chi-squared  statistics  are  approximately  mutually  independent  in  the  "null"  case  where  all 
predictor  variables  are  independent  of  the  response  variable,  the  statistics  may  be  combined  over  the  nodes  of  the  tree  to  form 
an  overall  measure  of  importance  for  each  variable.  Furthermore,  by  approximating  these  scores  with  a  single  scaled  chi- 
squared  distribution,  a  threshold  for  separating  the  important  from  unimportant  variables  can  be  obtained.  The  availability  of  a 
threshold  is  a  major  accomplishment  because  such  thresholds  are  not  generally  provided  by  previous  importance  scoring 
methods,  which  significantly  limits  their  usefulness.  The  technique  is  first  reported  in  Loh  (2012). 

(3)  Multiresponse  and  longitudinal  data. 

Although  there  exist  a  few  regression  tree  algorithms  for  longitudinal  response  variables,  all  of  them  are  afflicted  with  variable 
selection  bias,  wherein  variables  that  permit  more  splits  are  more  likely  to  be  selected,  even  when  all  the  variables  are 
independent  of  the  response.  Further,  because  these  methods  fit  a  linear  mixed  model  to  the  data  in  each  node,  they  are 
unstable  and  can  be  highly  compute-intensive  if  the  data  set  is  large.  During  the  period  of  this  grant,  a  completely  new  approach 
to  the  problem  was  implemented  in  GUIDE.  The  key  idea  is  to  treat  each  longitudinal  series  as  a  random  curve  and  then  use 
the  predictor  variables  to  split  the  data  by  grouping  the  curves  according  to  their  shapes.  As  a  result,  no  model  assumptions  are 
required  and  the  observation  time  points  and  their  number  may  be  fixed  or  random.  The  same  technique  can  be  applied  to 
multiresponse  data  where  each  subject  is  observed  on  two  or  more  response  variables.  Details  of  the  method  are  reported  in 
Loh  and  Zheng  (2103). 

(4)  Subgroup  identification  for  censored  and  uncensored  responses. 

Difficult  diseases  such  as  cancer  are  hard  to  treat  because  not  all  patients  respond  equally  to  any  given  drug.  As  a  result,  the 
average  efficacy  of  a  new  drug  for  all  patients  tends  to  be  low,  making  it  difficult  to  get  approval  from  regulatory  agencies. 
Current  industry  thinking  is  to  identify  subgroups  of  the  patient  population,  defined  in  terms  of  characteristics  such  as  gender, 
family  history  of  disease,  etc.,  as  well  as  genetic  traits,  for  which  a  drug  has  an  enhanced  effect.  Because  a  regression  tree 
naturally  divides  the  sample  and  hence  the  population  into  subgroups  of  this  type,  there  have  been  several  attempts  to  use  this 
approach  to  solve  the  problem.  During  this  reporting  period,  the  GUIDE  algorithm  was  extended  to  provide  a  number  of  different 
alternative  solutions,  depending  on  whether  the  subgroups  are  to  be  defined  in  terms  of  prognostic  or  predictive  factors.  Results 
based  on  real  and  simulated  data  sets  indicate  that  the  GUIDE  solutions  are  superior  to  the  previous  ones  in  terms  of  (i) 
accuracy  in  identification,  (ii)  computational  speed,  (iii)  bias  in  selection  of  variables  used  to  split  the  nodes,  and  (iv)  extensibility 
to  comparisons  of  more  than  two  treatments  and  to  censored  response  variables.  A  manuscript  of  the  results  is  being  prepared 
for  publication. 
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